Anharmonicity and self-similarity of the free energy landscape of protein G 
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The near-native free energy landscape of protein G is investigated through 0.4^is-long atomistic 
molecular dynamics simulations in explicit solvent. A theoretical and computational framework is 
used to assess the time-dependence of salient thermodynamical features. While the quasi-harmonic 
character of the free energy is found to degrade in a few ns, the slow modes display a very mild 
| dependence on the trajectory duration. This property originates from a striking self-similarity of the 

free energy landscape embodied by the consistency of the principal directions of the local minima, 
^ 1 where the system dwells for several ns, and of the virtual jumps connecting them. 
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The multiscale protein dynamics observed in single-molecule experiments 0, Q provides a vivid reverberation of the 
complexity of the free energy surface of folded proteins 0, |3] ■ Complementing these phenomenological observations 
with detailed theoretical and computational investigations has proved challenging. For instance, most of the atomistic 
molecular dynamics (MD) explorations of the free-energy landscape organization have probed features with a time 
resolution below the ns scale. Though limited, this time span is sufficient to reveal the presence, within the native 
basin, of a plethora of local energy minima corresponding to minute structural differences 0, @, 0|- Within the 
short time spent in each of these minima, about 1 ps 6], the normal modes analysis based on the local quadratic 
approximation of the energy adequately captures the system dynamics [§[ . It is surprising, however, that the 10-20 
slowest normal modes of any of these minima are sufficient to describe the large scale conformational changes occurring 
over tens of ns or more e.g. switches between active/inactive enzymatic forms etc. This remarkable property, 

exhibited by generic globular proteins indicates a special multiscale organization of the free energy landscape. In 
this study we dissect, by quantitative means, this organization for a particular globular biopolymer, protein G. In 
particular, we analyse the properties of the phase space explored by atomistic molecular dynamics simulations covering 
as much as 0.4 ^s. This represents an increase of more than two orders of magnitude in duration with respect to 
£N1 . previous related investigations and hence allows an unprecedented insight into the organization of the free energy 
surface @, 0]. The latter is shown to comprise minima whose self-similarity extends, unexpectedly, to the largest 
examined time scales. The observed self-similarity is of a peculiar kind as it pertains only to the principal directions 
of the minima which otherwise differ significantly in depth and breadth. The findings provide a novel insight on the 
robustness of the large-scale concerted movements in proteins obtained either through principal component analysis 
of MD simulations or from simplified mesoscopic models. 

We first consider the assumption, at the basis of several elastic network approaches 1(J II |, that the free energy 
• governing the near-native dynamics of a protein has a quadratic character: 

T = Mij tllv Sr^ Srj.y (1) 

; i,3,H,v 

where 5ri tll indicates the //th Cartesian component of the vector displacement of the ith amino acid (represented by 
the C Q atom) from its native reference position. Despite the minimalistic character, harmonic models based on eq. 
((T|) have proved capable of capturing large-scale movements occurring over time scales that vastly exceed those of 
expected validity for local quadratic expansions of the free energy 0, Q . Prompted by these results we investigate in 
detail the extent to which extensive atomistic MD simulations support the simple quadratic form of eqn. |T]) . First we 
shall recover the matrix of effective couplings, M, that best describes one or more constant-temperature dynamical 
trajectories (from which roto-translations are previously removed by optimally superimposing each recorded system 
configuration to a given reference structure.) The effective matrix M associated to an equilibrated trajectory of 
duration At will be obtained from the covariance matrix, C [12t [13| : 

C ijtfiV (At) = ([r^(t) - (r^)] [r jiV (t) ~ (n, v )]) (2) 

where fi (t) is the position of the ith amino acid at time t and the brackets denote the time-average over a specific time 
interval of duration At. Within the quadratic approximation, the coupling matrix is given by M( At) = C -1 ( At) / k b T, 
where kbT is the thermal energy and the tilde indicates the removal of the six roto-translational zero eigenspaces of 
C prior to inversion. 
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FIG. 1: [Color online] (a) Ribbon representation of protein G. (b) Color-coded plot of the Kendall's correlation coefficient, r, 
between corresponding elements of the coupling matrices, M, calculated over the first At and At' ns of the first trajectory. 
Matrix elements along the diagonal or pertaining to consecutive C a 's were omitted in the calculation of r. 

Next, by considering longer and longer trajectories, we shall examine how M depends on the breadth of the visited 
phase space. The consistency of two M matrices will be assessed by comparing both the corresponding entries and their 
ten slowest modes. The latter correspond to the eigenvectors of M associated to the smallest non-zero eigenvalues. 
The similarity of two sets of slow modes, {v} and {w}, is measured, as customary, in terms of the root mean square 
inner product (RMSIP): 



This analysis is undertaken for protein G, an important signalling biomolecule shown in Fig. [T^l. It possesses a 
non-trivial a//3 tertiary organization despite its moderate length, 56 residues, which makes it amenable to extensive 
MD simulations. The 1PGB crystal structure was taken as the starting point of 4 MD runs in explicit solvent each 
of 100 ns. First the system was energy minimized after solvation with the single-point charge water model in an 
octahedral box containing an explicit solvent layer of 1.2 nm. The density was adjusted by a lOOps-long MD in 
NPT conditions by weak coupling to a bath of constant pressure (P = 1 bar, coupling time t=0.5 ps). Subsequently 
the four different 100-ns long trajectories were started with different sets of Maxwellian (T=300 K) initial atomic 
velocities and integrated with the GROMACS engine[3] The particle mesh Ewald method was used for electrostatics 
and thermostatting was provided by a thermal bath (coupling time r=0.1ps). To remove correlations due to the 
same starting structure we omitted the initial 10 ns from the analysis. The energy-mininised initial structure is, 
on the other hand, used as the common reference configuration to roto-translate, and hence align in space, all four 
trajectories. From the remaining 90 ns long trajectories, intervals of increasing duration At are used to compute the 
coupling matrices M(At) from inversion of the covariance ones. 

Pairs of M matrices from the different runs and/or for increasing At are first compared by measuring the Kendall's 
correlation coefficient [l5| . r, of the ~13000 corresponding distinct entries. We employ Kendall's r as it provides a 
robust measure of data association with no prior assumption of the parametric dependence of the examined data sets. 
To convey the degree of consistency of M upon extending the length of the simulation we portray in Fig. [TJd the 
Kendall's r pertaining to pairs of intervals of different durations from the first trajectory. As visible in the figure, 
extending the duration of the trajectory, be it 1 or 16 ns, by four or more times leads to a substantial deviation of 
corresponding entries of M (with r ~ 0.30). Consistently, halving each trajectory and considering the correlation 
among the two halves gives values of r between 0.17 and 0.25. The consistency of M matrices of different trajectories 
is, furthermore, much poorer and almost independent on the compared time spans. In fact, pairwisc comparisons of 
different runs over the first ns or over the entire 90-ns duration trajectories yield values of r in the [0.05- 0.15] range. 

These values indicate a substantial degree of heterogeneity in corresponding matrix entries and hence point to 
the impossibility of having a robust definition of M even over hundreds of ns. An analogous loss of consistency is 
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FIG. 2: Normalised distribution of the projection along the first slow mode of the conformational fluctuations encountered in 
intervals of increasing length from the first trajectory. 

observed in the covariance matrices, with t ~ 0.15 for pairwise comparions of corresponding C entries calculated 
over the entire trajectories. Yet, despite the fact that no asymptotic value appears to be reached by the M (or C) 
entries, the latter display general properties which are robust against At and hence may be taken into account to 
improve elastic network models based on the a priori quadratic approximation of eq. [1] For example, consistently 
with previous findings based on the Hessian analysis [lq |. consecutive C Q 's are coupled by interactions that are more 
than 10 times stronger than non-consecutive ones. Furthermore, in all trajectories, negligible pairwise couplings are 
typically found among centroids at a distance larger than 8 — 9 A. Interestingly, below this cutoff, pairwise interactions 
can have either attractive or repulsive character (yet all eigenvalues of M are clearly non-negative). 

We shall now address the origin of the apparent lack of robustness of M by considering, in place of the individual 
matrix entries, the time dependence of the slow modes. For a system governed by a purely quadratic free energy, the 
fluctuations around the average structure have a Gaussian probability distribution along any of the eigenvectors of 
M. The width of the Gaussian is largest for the slowest mode which, corresponding to the direction of least curvature 
of J-, mostly accounts for the system fluctuations in thermal equilibrium. A valuable insight into the effective free 
energy landscape described by M(Ai) is hence obtained by considering the distribution of the projections along the 
slowest mode of the conformational fluctuations in trajectories of increasing duration. Typical results are shown in 
Fig. O It is noticed that, for At = 1 ns, the distributions have a unimodal character with a fair degree of Gaussianity. 
In fact, for all four trajectories, the normalised kurtosis k = ((x 4 ) — 3(x 2 ) 2 )/ (x 4 ) (x being the projection) typically has 
values of 0.15 for At up to 2 ns. For time spans larger than 5-10 ns deviations from Gaussianity become very apparent 
through the non-unimodal character of the distribution. The analysis of the kurtosis in the fluctuations of individual 
amino acids reveals that the largest deviations are observed for exposed regions, particularly residues 30-40. Also, the 
eigenvalue of M associated to the slowest mode decreases as At is increased, thus suggesting a progressive weakening 
of the quadratic potential upon the widening of the explored phase space. This softening can be vividly illustrated 
from a novel perspective by means of a simplified, but transparent, dynamical variational approach. More precisely, 
we shall consider the projection of a trajectory along its slowest mode i?iand attempt to describe its time-varying 
amplitude with a deterministic harmonic vibration of the amino acids: 

r=™(t) =jfi(t = 0) + av 1A [cos(ujt + cj>) - cos(0)] (4) 

where the superscript m is used to denote the coordinates of the model oscillators which coincide with those of 
the real system (superscript 0) at t = 0. The quantities uj, <f> and a are obtained from a dynamical variational 
criterion [17] . namely by minimizing the time-averaged total square deviation of the model trajectory from the real 
one, (Y^i \x" l (t) — x®(t)\ 2 ). As visible in Fig. O for time spans of up to fractions of a ns the oscillator is able to account 
satisfactorily for the evolution of the true trajectory (manifestly overdamped over At > 0.5 ns). The lower panel of 
Fig. [31 presenting a visible decrease of the optimal frequency u, illustrates how rapidly the curvature of the effective 
quadratic free energy decreases as a function of At. The softening reflects the complexity and anharmonicity of the 
free energy landscape which, as confirmed by the multimodal character of the distributions of Fig. [2[ is constituted 
by broad minima of varying depth which are progressively explored as the dynamics advances. 

The properties of the most pronounced minima have been examined in detail and lead to discovering that the 
self-similarity of the free energy landscape, previously ascertained with a sub-ns time resolution extends to an 
unsuspectedly large scale. To characterize the salient minima we first performed a cluster analysis of the structures 
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FIG. 3: Top: time evolution of the projection along the 1st eigenvector and of the variational deterministic oscillator for 
two intervals of different duration. Bottom: average value of the variational oscillator frequency as a function of the interval 
duration, At. 



generated in all four trajectories [1 81 ] and considered the 10 most populated sets. Each cluster, containing conforma- 
tions at less than 1.5 A RMSD from the representative, gathers configurations from time- intervals of very different 
length, from 1 to tens of ns, originating from one to three of the four trajectories. By carrying out a structural 
covariance analysis (with the time average of eqn. [2] being replaced with an average over cluster members) we iden- 
tified the 10 principal components describing the largest conformational changes in each cluster. By means of the 
RMSIP we finally compared the 10 principal components of all distinct pairs of clusters. The resulting distribution 
of RMSIP values indicated a very strong consistency of the sets of principal directions, see Fig. [4Jl Indeed, for the 
considered protein length, numerical results indicate that if the set of principal components were completely unrelated, 
the expected RMSIP value would be 0.24 ± 0.02. The distribution of values in Fig. [5] is sufficiently distant from this 
random reference value to convey the significance of the observed consistency of the principal components of different 
clusters. Strikingly, it was also found that the difference vectors, dij connecting the representatives of any pair of 
different clusters i and j are also well described by the principal components of any of the clusters. For example, as 
shown in Fig. [4Jd, the top 10 principal components of the largest cluster are usually sufficient to account for most 
of the norm of the "virtual jumps" connecting the representatives of any two top clusters. These facts reveal an 
appealing self-similar structure of the free-energy profile: not only the deep free-energy minima corresponding to the 
main clusters have similar principal components, but also the virtual "jumps" connecting their representatives are 
describable in the same low-dimensional space. This remarkable property can be formalised by using a generalization 
of the covariance decomposition of ref. [7(. In fact, for a generic MD trajectory that has visited several free-energy 
minima (clusters) the covariance matrix reflects the structural heterogeneity within and across the clusters 0] • More 
precisely: 



C = Wi{C l + y^ j w m w k |dj,fc)(d;, m |} 



(5) 



where wi is the weight of the Ith. cluster, that is the fraction of time spent by the system in it, C l is the covariance 
matrix of the Zth cluster, and d^ m is the distance vector of the representative (average) structures of clusters / and m. 
We now build on the previous observation that a single set of principal components Ui v n can well describe both the 
essential spaces of any cluster I and also the inter-cluster distance vectors: C l = KiWn}(v n \ and d^ m = J2 n c n m ^n- 
According to this approximation, the principal eigenvectors of C, and hence the slow modes of M, would coincide with 
wi,... v n and thus remain unchanged over all time scales. On the contrary, the associated eigenvalues would explicitly 
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FIG. 4: (a) Histogram of the RMSIP values calculated for the principal components of all pairs of top structural clusters, 
(b) Histogram of the norm of the projection of all pairwise distance vectors between cluster representatives along the top 10 
principal components of the largest cluster. 
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FIG. 5: RMSIP between the top essential dynamical spaces of the 1st ns of trajectory 1 and intervals of longer duration, At, 
from the same and other trajectories. 



depend on the MD duration through the time-dependent number and weight of the visited clusters. To verify these 
conclusions we have compared the essential spaces of different intervals from the simulated trajectories. The results 
are illustrated in Fig. [5] which portrays the RMSIP calculated between the essential spaces of the 1st ns for trajectory 
1 with larger and larger time spans for the same and other trajectories. It is seen that the top slowest modes are very 
robust against increasing At and remain consistent even augmenting the simulation time by two orders of magnitude 
(from 1 to 90 ns). The statistical significance of this result is highlighted by the difference of RMSIP ranges in Fig. 
[5] from the aforementioned random reference value of 0.24. As a further comparison we also considered the RMSIP 
value calculated over all pairs of 10 mid-ranking eigenvectors of the last 1 ns of all four trajectories. Also this more 
stringent test indicates that the RMSIP 's in Fig. [5] exceed the control value by at least 4 standard deviations and 
hence have a high statistical significance. 

In summary, the analysis of the dynamics of protein G, a typical globular protein, revealed an unexpectedly 
simple self-similar structure of the various free energy minima and of the virtual jumps connecting them. This 
remarkable feature reflects into the exceptional robustness of the essential dynamical spaces (slow modes) calculated 
over trajectories with very different duration. Instead, the typical amplitudes projected along the slowest modes 
depends on the number of visited minima as well as on their depth. As a result, the dynamical projections have 
a strong dependence on the duration of the simulation, a fact that accounts for the observed inconsistency of the 
coupling (or covariance) matrix entries. The observed properties, besides elucidating general features of the free 
energy landscape of one particular protein, have important practical ramifications. In particular they provide a first 
perspective, for understanding the scope and viability of coarse-grained elastic network models as well as short MD 
simulations. Accordingly, it is expected that the directionality of the slow modes/essential dynamical spaces can be 
determined with considerable more confidence than the amplitude of the associated dynamical projections. These 
considerations provide a strong motivation for investigating the validity and transferability of the present analysis to 
other protein contexts. 
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